20 research outputs found
An open-source voice type classifier for child-centered daylong recordings
Spontaneous conversations in real-world settings such as those found in
child-centered recordings have been shown to be amongst the most challenging
audio files to process. Nevertheless, building speech processing models
handling such a wide variety of conditions would be particularly useful for
language acquisition studies in which researchers are interested in the
quantity and quality of the speech that children hear and produce, as well as
for early diagnosis and measuring effects of remediation. In this paper, we
present our approach to designing an open-source neural network to classify
audio segments into vocalizations produced by the child wearing the recording
device, vocalizations produced by other children, adult male speech, and adult
female speech. To this end, we gathered diverse child-centered corpora which
sums up to a total of 260 hours of recordings and covers 10 languages. Our
model can be used as input for downstream tasks such as estimating the number
of words produced by adult speakers, or the number of linguistic units produced
by children. Our architecture combines SincNet filters with a stack of
recurrent layers and outperforms by a large margin the state-of-the-art system,
the Language ENvironment Analysis (LENA) that has been used in numerous child
language studies.Comment: accepted to Interspeech 202
Probing phoneme, language and speaker information in unsupervised speech representations
International audienceUnsupervised models of representations based on Contrastive Predictive Coding (CPC) [1] are primarily used in spoken language modelling in that they encode phonetic information. In this study, we ask what other types of information are present in CPC speech representations. We focus on three categories: phone class, gender and language, and compare monolingual and bilingual models. Using qualitative and quantitative tools, we find that both gender and phone class information are present in both types of models. Language information, however, is very salient in the bilingual model only, suggesting CPC models learn to discriminate languages when trained on multiple languages. Some language information can also be retrieved from monolingual models, but it is more diffused across all features. These patterns hold when analyses are carried on the discrete units from a downstream clustering model. However, although there is no effect of the number of target clusters on phone class and language information, more gender information is encoded with more clusters. Finally, we find that there is some cost to being exposed to two languages on a downstream phoneme discrimination task
Enregistrements de longue durée: Opportunités et défis
International audienceTechnological developments have allowed the development of lightweight, wearable recorders that collect audio (including speech) lasting up to a whole day. We provide a general description of the technique and lay out the advantages and drawbacks when using this methodology. Field linguists may gain a uniquely naturalistic viewpoint of language use as people go about their everyday activities. However, due to their duration, noisiness, and likelihood of containing sensitive information, long-form recordings remain difficult to annotate manually. Open-source tools improve reproducibility and ease-of-use for researchers, to which end speech technologists can contribute. Additionally, new approaches to human and automated annotation make the study of speech in longform recordings increasingly feasible and promising
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models
Self-supervised techniques for learning speech representations have been
shown to develop linguistic competence from exposure to speech without the need
for human labels. In order to fully realize the potential of these approaches
and further our understanding of how infants learn language, simulations must
closely emulate real-life situations by training on developmentally plausible
corpora and benchmarking against appropriate test sets. To this end, we propose
a language-acquisition-friendly benchmark to probe spoken language models at
the lexical and syntactic levels, both of which are compatible with the
vocabulary typical of children's language experiences. This paper introduces
the benchmark and summarizes a range of experiments showing its usefulness. In
addition, we highlight two exciting challenges that need to be addressed for
further progress: bridging the gap between text and speech and between clean
speech and in-the-wild speech.Comment: Proceedings of Interspeech 202
ProsAudit, a prosodic benchmark for self-supervised speech models
We present ProsAudit, a benchmark in English to assess structural prosodic
knowledge in self-supervised learning (SSL) speech models. It consists of two
subtasks, their corresponding metrics, and an evaluation dataset. In the
protosyntax task, the model must correctly identify strong versus weak prosodic
boundaries. In the lexical task, the model needs to correctly distinguish
between pauses inserted between words and within words. We also provide human
evaluation scores on this benchmark. We evaluated a series of SSL models and
found that they were all able to perform above chance on both tasks, even when
evaluated on an unseen language. However, non-native models performed
significantly worse than native ones on the lexical task, highlighting the
importance of lexical knowledge in this task. We also found a clear effect of
size with models trained on more data performing better in the two subtasks.Comment: Accepted at Interspeech 2023. 4 pages + references, 1 figur
Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation
Most automatic speech processing systems are sensitive to the acoustic
environment, with degraded performance when applied to noisy or reverberant
speech. But how can one tell whether speech is noisy or reverberant? We propose
Brouhaha, a pipeline to simulate audio segments recorded in noisy and
reverberant conditions. We then use the simulated audio to jointly train the
Brouhaha model for voice activity detection, signal-to-noise ratio estimation,
and C50 room acoustics prediction. We show how the predicted SNR and C50 values
can be used to investigate and help diagnose errors made by automatic speech
processing tools (such as pyannote.audio for speaker diarization or OpenAI's
Whisper for automatic speech recognition). Both our pipeline and a pretrained
model are open source and shared with the speech community
Speaker detection in the wild: Lessons learned from JSALT 2019
Submitted to ICASSP 2020This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an effective diarization improves detection, and not having a diarization stage impoverishes the performance. All the different configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection